High-Throughput Looping in Stream Processors Using Self-Timed Architectures
نویسندگان
چکیده
Special-purpose stream-processing systems are becoming increasingly popular due to the demands of multimedia applications, digital signal processing, and cryptography. In stream-processing systems, algorithms are typically implemented as a series of hardware stages. Each hardware stage receives the data stream from the previous stage and, after processing, sends the new data stream to the next stage. The focus of this work is on application-specific stream processors, rather than general-purpose programmable devices. The presence of conditional constructs (e.g., conditional branches, loops, etc.) in an algorithm poses a significant challenge to creating efficient stream processors. This difficulty is further exacerbated if the control constructs are data-dependent, e.g., a loop with a variable iteration count. Past work by Kapasi, Dally et al. [2] only addressed the problems related to conditional branches, but did not address the critical challenge of pipelining loops. Our work addresses this key challenge by providing a comprehensive approach to handle various types of loop constructs, thereby extending the set of algorithms that can be efficiently implemented as stream processors. We target not only fixed-iteration “for” loops, but also variable-iteration (e.g., data-dependent) “while” loops. Furthermore, our approach extends to nested as well as unrolled loops. Our method is applicable to those algorithms that iterate several times on the same data set (e.g., iterative ODE solvers). Our strategy is to allow the body of the loop to operate on multiple data sets concurrently. This requires extra storage to allow each data set to maintain a distinct copy of its algorithmic state. A key contribution is the design of an efficient loop control block that supervises loop activity, and ensures that the loop is operating at maximum achievable throughput, i.e., neither sparsely occupied nor congested. Our implementation architecture is based on asynchronous or self-timed pipelines and rings, which offer several benefits for the design of stream processors. Self-timed circuits are more energy-efficient because they consume little energy when idle. A self-timed approach can also reduce design effort considerably through greater design modularity.
منابع مشابه
Buffer Sizing for Self-timed Stream Programs on Heterogeneous Distributed Memory Multiprocessors
Stream programming is a promising way to expose concurrency to the compiler. A stream program is built from kernels that communicate only via point-to-point streams. The stream compiler statically allocates these kernels to processors, applying blocking, fission and fusion transformations. The compiler determines the sizes of the communication buffers, which affects performance since local memo...
متن کاملInteracting Self-Timed Pipelines and Elementary Coupling Control Modules
SUMMARY The self-timed pipeline (STP) is one of the most promising VLSI/SoC architectures. It achieves efficient utilization of tens of billions of transistors, consumes ultra low power, and is easy-to-design because of its signal integrity and low electromagnetic interference. These basic features of the STP have been proven by the development of self-timed data-driven multimedia processors, D...
متن کاملThread Cooperation in Multicore Architectures for Frequency Counting over Multiple Data Streams
Many real-world data stream analysis applications such as network monitoring, click stream analysis, and others require combining multiple streams of data arriving from multiple sources. This is referred to as multi-stream analysis. To deal with high stream arrival rates, it is desirable that such systems be capable of supporting very high processing throughput. The advent of multicore processo...
متن کاملDetermining the Order of Processor Transactions in Statically Scheduled Multiprocessors
This paper addresses embedded multiprocessor implementation of iterative, real-time applications, such as digital signal and image processing, that are specified as dataflow graphs. Scheduling dataflow graphs on multiple processors involves assigning tasks to processors (processor assignment), ordering the execution of tasks within each processor (task ordering), and determining when each task ...
متن کاملThree Dimensional, Massively Parallel, Optically Interconnected Silicon Computational Hardware and Architectures for High Speed IR Scene Generation
High frame rate infrared scene generation depends on high performance digital processors that are tightly coupled to infrared emitter arrays. Massively parallel image generation hardware can realize the type of high throughput, high frame rate processing that will characterize the next generation of scene generators. This work outlines projects in massively parallel, high throughput image gener...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006